rm(list = ls())
library(data.table)
library(ggplot2)
library(ggdist)
library(colourpicker) #addin useful for selecting colours
library(cowplot) #theme for plotting
library(data.table) #package for manipulating and computing on data
library(raincloudplots) #https://wellcomeopenresearch.org/articles/4-63
library(Lahman) #has data sets related to baseball (Allstar and Pitching)
library(palmerpenguins) #has data sets related to penguins (penguins)
AllstarFull from Lahman package has baseball stats.
Pitching is another data set included with Lahman package and is more comprehensive.
palmerpenguins package has data sets related to penguins.
ggplot also has data set midwest included with it. Load this by doing data(“midwest”, package = “ggplot2”)
Need to copy and then convert the data to a data table
dat = copy(Pitching)
class(dat)
[1] "data.frame"
setDT(dat)
class(dat)
[1] "data.table" "data.frame"
examine data
head(dat)
str(dat)
Classes ‘data.table’ and 'data.frame': 48399 obs. of 30 variables:
$ playerID: chr "bechtge01" "brainas01" "fergubo01" "fishech01" ...
$ yearID : int 1871 1871 1871 1871 1871 1871 1871 1871 1871 1871 ...
$ stint : int 1 1 1 1 1 1 1 1 1 1 ...
$ teamID : Factor w/ 149 levels "ALT","ANA","ARI",..: 97 142 90 111 90 136 111 56 97 136 ...
$ lgID : Factor w/ 7 levels "AA","AL","FL",..: 4 4 4 4 4 4 4 4 4 4 ...
$ W : int 1 12 0 4 0 0 0 6 18 12 ...
$ L : int 2 15 0 16 1 0 1 11 5 15 ...
$ G : int 3 30 1 24 1 1 3 19 25 29 ...
$ GS : int 3 30 0 24 1 0 1 19 25 29 ...
$ CG : int 2 30 0 22 1 0 1 19 25 28 ...
$ SHO : int 0 0 0 1 0 0 0 1 0 0 ...
$ SV : int 0 0 0 0 0 0 0 0 0 0 ...
$ IPouts : int 78 792 3 639 27 3 39 507 666 747 ...
$ H : int 43 361 8 295 20 1 20 261 285 430 ...
$ ER : int 23 132 3 103 10 0 5 97 113 153 ...
$ HR : int 0 4 0 3 0 0 0 5 3 4 ...
$ BB : int 11 37 0 31 3 0 3 21 40 75 ...
$ SO : int 1 13 0 15 0 0 1 17 15 12 ...
$ BAOpp : num NA NA NA NA NA NA NA NA NA NA ...
$ ERA : num 7.96 4.5 27 4.35 10 0 3.46 5.17 4.58 5.53 ...
$ IBB : int NA NA NA NA NA NA NA NA NA NA ...
$ WP : int 7 7 2 20 0 0 1 15 3 44 ...
$ HBP : int NA NA NA NA NA NA NA NA NA NA ...
$ BK : int 0 0 0 0 0 0 0 2 0 0 ...
$ BFP : int 146 1291 14 1080 57 3 70 876 1059 1334 ...
$ GF : int 0 0 0 1 0 1 1 0 0 0 ...
$ R : int 42 292 9 257 21 0 30 243 223 362 ...
$ SH : int NA NA NA NA NA NA NA NA NA NA ...
$ SF : int NA NA NA NA NA NA NA NA NA NA ...
$ GIDP : int NA NA NA NA NA NA NA NA NA NA ...
- attr(*, ".internal.selfref")=<externalptr>
playerID: Player ID code
yearID: Year
stint: player’s stint (order of appearances within a season)
teamID: Team (factor)
lgID: League ID a factor with levels AA, AL, FL, NL, PL, UA
W: Wins
L: Losses
G: Games
GS: Games Started
CG: Complete Games
SHO: Shutouts
SV: Saves IPouts Outs Pitched (innings pitched x 3)
H: Hits
ER: Earned Runs
HR: Homeruns
BB: Walks
SO: Strikeouts
BAOpp: Opponent’s Batting Average
ERA: Earned Run Average
IBB: Intentional Walks
WP: Wild Pitches
HBP: Batters Hit By Pitch
BK: Balks
BFP: Batters faced by Pitcher
GF: Games Finished R Runs Allowed
SH: Sacrifices by opposing batters
SF: Sacrifice flies by opposing batters
GIDP: Grounded into double plays by opposing batter
(compiled from: https://ourcodingclub.github.io/tutorials/datavis/)
geom
Geometric object which defines the type of graph you are making.
It reads your data in the aesthetics mapping to know which variables to use, and creates the graph accordingly.
Some common types are:
geom_point()
geom_boxplot()
geom_histogram()
geom_col()
aes
Short for aesthetics.
Usually placed within a geom_, this is where you specify your data source and variables, AND the properties of the graph which depend on those variables.
For instance, if you want all data points to be the same colour, you would define the ‘colour =’ argument outside the aes() function; if you want the data points to be coloured by a factor’s levels (e.g. by site or species), you specify the colour = argument inside the aes().
Some common things to include in aes are:
x
y
fill
colour
size
shape
But note that different geoms have different aesthetics available (see cheatsheet below for example)
stat
a stat layer applies some statistical transformation to the underlying data: for instance, stat_smooth(method = ‘lm’) displays a linear regression line and confidence interval ribbon on top of a scatter plot (defined with geom_point()).
theme
A set of visual parameters that control the background, borders, grid lines, axes, text size, legend position, etc.
You can use pre-defined themes (e.g., theme_complot() from the cowplot package), create your own, or use a predefined theme and overwrite only the elements you don’t like.
Examples of elements within themes are:
e.g., axis.text.y = element_text(size = 12)
e.g., axis.text.x = element_text(size = 12, angle = 45, vjust = 1, hjust = 1)
[makes the x labels at an angle]
e.g., axis.title = element_text(size = 14, face = “plain”)
e.g., panel.grid = element_blank()
[Removes the background grid lines]
e.g., plot.margin = unit(c(1,1,1,1), units = , “cm”)
[Adds a 1cm margin around the plot]
e.g., legend.text = element_text(size = 12, face = “italic”)
[Setting the font for the legend text]
e.g., legend.title = element_blank()
[Remove the legend title - useful as sometimes this is excessive and the default is to include it]
e.g., legend.position = c(0.9, 0.9)))
+theme(axis.text.x = element_text(size = 12, angle = 45, vjust = 1, hjust = 1),
axis.text.y = element_text(size = 12),
axis.title = element_text(size = 14, face = “plain”),
panel.grid = element_blank(),
plot.margin = unit(c(1,1,1,1), units = , “cm”),
legend.text = element_text(size = 12, face = “italic”),
legend.title = element_blank(),
legend.position = c(0.9, 0.9))
You define their properties with elements_…() functions. For example:
element_blank() would return something empty (ideal for removing background colour),
element_text(size = …, face = …, angle = …) lets you control all kinds of text properties.
# theme(axis.text.x = element_text(size = 12, angle = 45, vjust = 1, hjust = 1), # making the years at a bit of an angle
#try a plot of home runs over year
ggplot(dat, aes(x=yearID, y=H))+geom_point()
#equivalent to
ggplot(dat)+geom_point(aes(x=yearID, y=H))
top tip: by encircling the ggplot in parenthesis () you get to assign a plot to a variable and plot it at the same time. useful if you want to save the plot or make it into a figure, refer to it later (e.g., replot, put in a panel with other figs) etc. Example here using the same plot as above
(plot1 = ggplot(dat)+geom_point(aes(x=yearID, y=H)))
remove grey background with +theme_bw()
(plot1 = ggplot(dat)+geom_point(aes(x=yearID, y=H)) + theme_bw())
many other themes are available
(plot1 = ggplot(dat)+geom_point(aes(x=yearID, y=H)) + theme_classic())
(plot1 = ggplot(dat)+geom_point(aes(x=yearID, y=H)) + theme_minimal())
(plot1 = ggplot(dat)+geom_point(aes(x=yearID, y=H)) + theme_cowplot())
you can also create your own theme!
Just write it as a function. Example here taken from: https://rpubs.com/jenrichmond/W6LL
#library(data.table)
#library(palmerpenguins)
#library(cowplot)
##library(ggplot)
theme_jen <- function () {
# define font up front
font <- "Helvetica"
# this theme uses theme_bw as the base
theme_bw() %+replace%
theme(
#get rid of grid lines/borders
panel.border = element_blank(),
panel.grid.major = element_blank(),
panel.grid.minor = element_blank(),
# add white space top, right, bottom, left
plot.margin = unit(c(1, 1, 1, 1), "cm"),
# custom axis title/text/lines
axis.title = element_text(
family = font,
size = 14),
axis.text = element_text(
family = font,
size = 12),
# margin pulls text away from axis
axis.text.x = element_text(
margin=margin(5, b = 10)),
# black lines
axis.line = element_line(colour = "black", size = rel(1)),
# custom plot titles, subtitles, captions
plot.title = element_text(
family = font,
size = 18,
hjust = -0.1,
vjust = 4),
# custom plot subtitles
plot.subtitle = element_text(
family = font,
size = 14,
hjust = 0,
vjust = 3),
# custom captions
plot.caption = element_text(
family = font,
size = 10,
hjust = 1,
vjust = 2),
# custom legend
legend.title = element_text(
family = font,
size = 10,
hjust = 0),
legend.text = element_text(
family = font,
size = 8,
hjust = 0),
#no background on legend
legend.key = element_blank(),
# white background on plot
strip.background = element_rect(fill = "white",
colour = "black",
size = rel(2)), complete = TRUE)
}
#source("theme_jen.R") # the script/function containing custom ggplot theme
(plot1 = ggplot(dat)+geom_point(aes(x=yearID, y=H)) + theme_jen())
add label to x and y axis plus add in various elements of theme
(plot1 = ggplot(dat)+geom_point(aes(x=yearID, y=H)) +
theme_classic()+
xlab('\nyear')+#\n adds blank line
ylab('n home runs')+ #\nadds blank line
theme(axis.text.x = element_text(size = 12, angle = 45, vjust = 1, hjust = 1), # making the years at a bit of an angle
axis.text.y = element_text(size = 12),
axis.title = element_text(size = 14, face = "plain"),
panel.grid = element_blank(),# Remove the background grid lines
plot.margin = unit(c(1,1,1,1), units = , "cm"), # Add a 1cm margin around the plot
legend.text = element_text(size = 12, face = "italic"), # Setting the font for the legend text
legend.title = element_blank(), # Removing the legend title
legend.position = c(0.9, 0.9)))
might be claner to do the same plot on mean H per year
(plot1 = ggplot(dat[, .(H=mean(H)), by=yearID])+geom_point(aes(x=yearID, y=H)) + theme_classic()+
xlab('\nyear')+
ylab('mean home runs per year')+
theme(axis.text.x = element_text(size = 12, angle = 45, vjust = 1, hjust = 1),
axis.text.y = element_text(size = 12),
axis.title = element_text(size = 14, face = "plain"),
panel.grid = element_blank(),
plot.margin = unit(c(1,1,1,1), units = , "cm"),
legend.text = element_text(size = 12, face = "italic"),
legend.title = element_blank(),
legend.position = c(0.9, 0.9)))
add a linear trendline using geom_smooth have to specficy method for this (method=“lm” or method=lm is fine). se is added by default (can add se=F to disable this)
(plot1 = ggplot(dat[, .(H=mean(H)), by=yearID], aes(x=yearID, y=H))+
geom_point()+
geom_smooth(method=lm)+
theme_classic()+
xlab('\nyear')+
ylab('mean home runs per year')+
theme(axis.text.x = element_text(size = 12, angle = 45, vjust = 1, hjust = 1), axis.text.y = element_text(size = 12), axis.title = element_text(size = 14, face = "plain"), panel.grid = element_blank(),plot.margin = unit(c(1,1,1,1), units = , "cm"), legend.text = element_text(size = 12, face = "italic"), legend.title = element_blank(),legend.position = c(0.9, 0.9)))
you can also add a specific formula in geom_smooth (e.g., y~x+x2+x3)
(plot1 = ggplot(dat[, .(H=mean(H)), by=yearID], aes(x=yearID, y=H))+
geom_point()+
geom_smooth(formula=y~x+x^2+x^3)+
theme_classic()+
xlab('\nyear')+
ylab('mean home runs per year')+
theme(axis.text.x = element_text(size = 12, angle = 45, vjust = 1, hjust = 1),
axis.text.y = element_text(size = 12),
axis.title = element_text(size = 14, face = "plain"),
panel.grid = element_blank(),
plot.margin = unit(c(1,1,1,1), units = , "cm"),
legend.text = element_text(size = 12, face = "italic"),
legend.title = element_blank(),
legend.position = c(0.9, 0.9)))
facet wrap this can be used to easily plot data in panels (e.g., plot
mean home runs over time for each leagueID - here I also distinguish
leagues by colour)/ seeing scales = “free_y” below means the y axis can
vary from plot to plot. You can also use nrow = or
ncol = to specify the numbers of rows/columns
dat$yearIDfact = as.factor(dat$yearID)
(plot1 = ggplot(dat[, .(H=mean(H)), by=.(lgID, yearID)], aes(x=yearID, y=H, colour=lgID))+
geom_point()+
facet_wrap(vars(lgID), scales = "free_y")+
theme_classic()+
xlab('\nyear')+ #\n adds blank line
ylab('mean home runs per year'))
facet_grid does a similar thing but organised into columns of rows
here use rows based on teamID
(plot1 = ggplot(dat[, .(H=mean(H)), by=.(lgID, yearID)], aes(x=yearID, y=H, colour=lgID))+
geom_point()+
facet_grid(lgID ~ .)+
theme_classic()+
xlab('\nyear')+ #\n adds blank line
ylab('mean home runs per year'))
columns based on teamID
(plot1 = ggplot(dat[, .(H=mean(H)), by=.(lgID, yearID)], aes(x=yearID, y=H, colour=lgID))+
geom_point()+
facet_grid(. ~ lgID)+
theme_classic()+
xlab('\nyear')+ #\n adds blank line
ylab('mean home runs per year'))
the power of stat_summary - avoids having to feed in summary data
ggplot(dat[yearID<1875, ], aes(x=yearID, y=H))+
stat_summary(fun.data = mean_se, geom="bar")+
stat_summary(fun.data = mean_se, geom="errorbar", width=0.5)+
theme_classic()
makes it super easy to then superimpose data points as can use the same data table
ggplot(dat[yearID<1875, ], aes(x=as.factor(yearID), y=H, fill=as.factor(yearID)))+
stat_summary(fun.data = mean_se, geom="bar", show.legend = FALSE)+
stat_summary(fun.data = mean_se, geom="errorbar", width=0.5)+
geom_jitter(width = 0.1, show.legend = FALSE, shape=21, colour="black")+
theme_classic()
ggplot(dat[yearID<1875, ], aes(x=as.factor(yearID), y=H, fill=as.factor(yearID)))+
stat_summary(fun = median, geom="crossbar", show.legend = FALSE)+
geom_jitter(width = 0.1, show.legend = FALSE, shape=21, colour="black")+
theme_classic()
ggplot(dat, aes(x=as.factor(yearID), y=H, fill=as.factor(yearID)))+
stat_summary(fun.data = mean_se, geom="bar", show.legend = FALSE)+
stat_summary(fun.data = mean_se, geom="errorbar", width=0.5)+
theme_classic()
ggplot(dat[yearID<1875], aes(x=yearID, y=H, fill=as.factor(yearID)))+
stat_summary(fun.data=mean_se, geom="bar")+
stat_summary(fun.data=mean_se, geom="errorbar", width=0.5)+
theme_cowplot()+
geom_jitter(width=0.1, shape=21)+
theme(axis.title = element_text(size=20), axis.text = element_text(size=20), legend.position = "none")+ylim(0,800)
A special subcategory as this is the most common plot I end up having to do.
Note on data wrangling
Non exhaustive list of links/resources I’ve used in the course of compiling this notebook
https://rpubs.com/jenrichmond/W6LL
https://rafalab.github.io/dsbook/ggplot2.html
http://r-statistics.co/Complete-Ggplot2-Tutorial-Part1-With-R-Code.html
https://ourcodingclub.github.io/tutorials/datavis/
https://ourcodingclub.github.io/tutorials/data-vis-2/
https://ourcodingclub.github.io/tutorials/qualitative/
This has some great videos https://www.youtube.com/c/RiffomonasProject